feat(adf): subscription-based provider routing with persona and skill chains by AlexMikhalev · Pull Request #705 · terraphim/terraphim-ai

AlexMikhalev · 2026-03-20T15:04:05Z

Summary

Extend orchestrator config with provider/model/fallback routing, ProviderTier enum, persona fields, and skill chain registry (Fixes Fix dependencies between workspace crates #28, Embed default config and attempt to fix tests #29, Fix svelte fetch results from search API - commented out in current build #32, Multiple front end fixes #34)
Subscription guard rejects banned opencode/ Zen prefix at runtime (Fixes Fixed APIs for create article from server and search #31)
Fallback dispatch with circuit breaker (3 failures -> open for 5min -> half-open probe) (Fixes Takes default settings from CARGO_MANIFEST_DIR #30)
Persona identity injection into agent prompts (8 Terraphim personas) (Fixes Docs draft - TBC #33)
Skill chain integration from terraphim-skills (18 skills) and zestic-engineering-skills (12 skills) with SkillResolver (Fixes Multiple front end fixes and backend fix to return empty article list #35, Improve settings handling #36)
OpenCodeEvent NDJSON parser for opencode CLI output (Fixes Add basic README.md #37)
11 integration tests covering provider routing, circuit breaker, subscription guard, NDJSON parsing, skill chains, personas (Fixes cargo fmt #38)
Production orchestrator.toml with 13 agents across Safety/Core/Growth layers (Fixes Create article index works differently when called from Axum server and from ripgrep middleware #39)
4 ADRs: subscription-only providers, four-tier routing, persona identity layer, kimi-for-coding implementation tier

Stats

21 files changed, +4,645 lines
103 tests passing (92 unit + 11 integration)

ADRs

ADR-002: Subscription-only model providers (ban opencode/Zen)
ADR-003: Four-tier model routing (Quick/Deep/Implementation/Oracle)
ADR-004: Terraphim persona identity layer (8 named personas)
ADR-005: kimi-for-coding as implementation tier

Test plan

cargo test -p terraphim_spawner -- 92 unit tests pass
Integration tests -- 11 tests pass (provider routing, circuit breaker, subscription guard, NDJSON, skills, personas)
cargo clippy -p terraphim_spawner -- clean
Rebuild adf binary on bigbox and restart service
Verify agents start with new provider routing

Security hardening for terraphim_rlm crate: 1. Created validation.rs module with: - validate_snapshot_name(): Prevents path traversal attacks - validate_code_input(): Enforces MAX_CODE_SIZE (1MB) limit - validate_session_id(): Validates UUID format - validate_recursion_depth(): Prevents stack overflow - Security constants: MAX_CODE_SIZE, MAX_INPUT_SIZE, MAX_RECURSION_DEPTH 2. Fixed race condition in firecracker.rs: - Changed snapshot counter from read-then-write to atomic write lock - Added validate_snapshot_name() call before snapshot creation - Prevents TOCTOU vulnerability where concurrent snapshots could exceed limit 3. Enhanced mcp_tools.rs: - Added MAX_CODE_SIZE validation for rlm_code tool - Added MAX_CODE_SIZE validation for rlm_bash tool - Returns proper MCP error format for validation failures Refs #426

Refs: PR #426 Features: - fcctl-core to terraphim_firecracker adapter - Sub-500ms VM allocation (267ms measured) - ULID-based VM ID enforcement - Full trait implementation with error preservation - 119 tests passing Validation: All acceptance criteria met

…erTier enum Add provider, fallback_provider, fallback_model, and provider_tier fields to AgentDefinition for subscription-based model routing (ADR-002, ADR-003). Add ProviderTier enum (Quick/Deep/Implementation/Oracle) with per-tier timeout values. Add opencode CLI support in spawner arg inference. All new fields are Optional with serde(default) for backward compatibility. Fixes #28 Refs #29 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add provider allowlist and banned_providers fields to OrchestratorConfig. Default banned list includes "opencode" (Zen pay-per-use proxy, ADR-002). Add validate_provider() method to spawner AgentConfig that rejects any model string starting with a banned prefix, while correctly allowing opencode-go/ (subscription) and kimi-for-coding/ (subscription). Fixes #31 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add persona_name, persona_symbol, persona_vibe, meta_cortex_connections, and skill_chain fields to AgentDefinition for the four-layer identity stack (Persona/Role/SFIA/Skills). All fields Optional/default for backward compatibility. See ADR-004. Fixes #32 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add SpawnRequest struct and spawn_with_fallback() method that: - Validates banned providers before spawn (ADR-002) - Uses ProviderTier timeout values (Quick=30s, Deep=60s, Impl=120s, Oracle=300s) - Retries with fallback provider/model on primary failure - Integrates per-provider circuit breakers (3 failures = open 5 min) - Returns SpawnerError on both primary and fallback failure SpawnRequest avoids circular dependency with terraphim_orchestrator by mirroring needed AgentDefinition fields in the spawner crate. Fixes #30 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add persona_name, persona_symbol, persona_vibe, and meta_cortex_connections to SpawnRequest. The build_persona_prefix() function generates a markdown identity block that is prepended to the agent task prompt when persona fields are configured. Agents without persona config are unaffected. Fixes #33 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add SkillChainRegistry with 31 terraphim-skills and 16 zestic-skills. Provides validate_chain() to verify skill chains and validate_skill_chains() on OrchestratorConfig to validate all agents. Backward compatible -- agents with empty skill_chain pass validation. Fixes #34 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add OpenCodeEvent struct with parsing for opencode run --format json output events (step_start, text, tool_use, step_finish, result). Includes text_content(), total_tokens(), parse_line(), and parse_lines() helper methods for structured output consumption. Fixes #37 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add SkillResolver in skill_resolver.rs that maps skill chain names to actual skill file paths from the terraphim-skills repository. Features: - SkillResolver with registry of terraphim skills (security-audit, code-review, session-search, local-knowledge, git-safety-guard, devops, disciplined-research, architecture, disciplined-design, requirements-traceability, testing, acceptance-testing, documentation, md-book, implementation, rust-development, visual-testing, quality-gate) - resolve_skill_chain() method that takes Vec<skill_name> and returns resolved skill metadata (name, description, applicable_to, paths, source) - Validation of skill chains with proper error handling - Comprehensive test coverage including valid resolution, missing skill errors, and empty chain handling - Exported from crate as SkillResolver, SkillSource, ResolvedSkill, SkillResolutionError Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…h Refs #36 Extend SkillResolver to support zestic-engineering-skills from zestic-ai/6d-prompts repository alongside existing terraphim-skills. Features: - Add SkillSource enum (Terraphim, Zestic) to distinguish skill origins - Initialize zestic skills registry with 12 skills: quality-oversight, responsible-ai, insight-synthesis, perspective-investigation, product-vision, wardley-mapping, business-scenario-design, rust-mastery, cross-platform, frontend, via-negativa-analysis, strategy-execution - Each resolved skill includes its source metadata - SkillChainRegistry validation accepts skills from both sources - Comprehensive tests for mixed skill chains (terraphim + zestic together) Tests added: - test_resolver_has_zestic_skills - test_resolve_zestic_skill - test_resolve_mixed_skill_chain - test_validate_mixed_skill_chain_valid - test_validate_only_zestic_skills - test_mixed_chain_with_invalid_skills - test_zestic_skill_source_in_resolved - test_all_skill_names_includes_zestic - test_zestic_skill_structure - test_resolver_custom_paths_for_zestic Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…efs #38

…routing Refs #39 Replace all legacy codex CLI references with opencode + subscription providers. Add persona, skill_chain, provider/model/fallback fields to all agents. New agents: compliance-watchdog, drift-detector, spec-validator, test-guardian, documentation-generator, implementation-swarm, compound-review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SSH non-interactive shells (systemd) do not include ~/.bun/bin or ~/.local/bin in PATH. Use full paths for opencode and claude binaries. Also fix TOML escape in --since quote. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Creates crates/terraphim_judge_evaluator with: - SimpleAgent struct wrapping terraphim_router for KG lookups - KgMatch struct for representing term matches - lookup_terms() method using Aho-Corasick automata - enrich_prompt() for appending KG context to judge prompts - Comprehensive tests for all functionality Also adds automation/judge/model-mapping.json configuration file.

Implements JudgeModelRouter for tier-based LLM model selection: - TierConfig struct for provider+model pairs - ModelMappingConfig for complete configuration - JudgeModelRouter with methods: * from_config(path) - load from JSON file * resolve_tier(tier) - get provider/model for a tier * resolve_profile(profile) - get tier sequence for a profile - Supports tiers: quick, deep, tiebreaker, oracle - Supports profiles: default, thorough, critical, exhaustive - Comprehensive tests for all functionality - Updates automation/judge/model-mapping.json with profile definitions

Implements JudgeAgent as a supervised agent combining: - SimpleAgent for Knowledge Graph context enrichment - JudgeModelRouter for tier-based model selection - SupervisedAgent trait from terraphim_agent_supervisor JudgeVerdict struct with: - verdict: final evaluation result (PASS/FAIL/NEEDS_REVIEW) - scores: BTreeMap of detailed category scores - judge_tier: which tier produced the verdict - judge_cli: command used for evaluation - latency_ms: evaluation duration Evaluation pipeline: 1. Load file content 2. Enrich with KG context (optional) 3. Select model tier based on profile 4. Parse verdict from CLI output (JSON or text format) Includes comprehensive tests for: - Verdict parsing and manipulation - Supervised agent lifecycle - Profile-based tier selection - System message handling

- Add stagger_delay_ms config field (default: 5000ms) - Insert stagger delay between Safety agent spawns in run() - Add random jitter (0 to stagger_delay_ms) for Core agent cron spawns - Add tests for stagger delay configuration Refs #16

- Add ReviewRequest struct with from_agent, to_agent, artifact_path, review_type - Add ReviewPair config for defining (producer, reviewer) pairs - Add review_queue: Vec<ReviewRequest> to orchestrator state - Add submit_review_request(), review_queue(), process_review_queue() methods - Add check_review_trigger() to automatically queue reviews on agent completion - Add tests for review queue operations and config loading Refs #17

- Add drift_detection module with DriftDetector and DriftReport - Load strategic goals from plans/ directory markdown files - Check drift every N ticks (configurable via drift_detection.check_interval_ticks) - Calculate drift score by comparing agent outputs against strategic goals - DriftReport includes agent, drift_score, and explanation - Log warnings when drift_score exceeds threshold (default: 0.6) - Add DriftDetectionConfig to OrchestratorConfig - Add tests for drift detection functionality Refs #18

- Add session_rotation module with SessionRotationManager and AgentSession - Add SessionRotationConfig with max_sessions_before_rotation (default: 10) - Track completions_since_rotation and completed_sessions per agent - on_agent_completion() records completion and triggers rotation at threshold - Rotation creates new session ID and clears accumulated context - Add comprehensive tests for session rotation Refs #19

- Add convergence_detector module with ConvergenceDetector and ConvergenceSignal - Add ConvergenceConfig with threshold (default: 0.95) and consecutive_threshold (default: 3) - Calculate output similarity using Jaccard index on word sets - Detect convergence after N consecutive similar outputs - ConvergenceSignal includes agent, similarity, and consecutive_count - Reset on divergence to handle changing outputs - Add comprehensive tests for convergence detection Refs #20

…efs #23

…iven mode Issue #8: Extend orchestrator config for issue-driven mode - Add WorkflowConfig with mode (time_only/issue_only/dual), poll_interval_secs, max_concurrent_tasks - Add TrackerConfig with tracker_type (gitea/linear), url, token_env_var, owner, repo - Add ConcurrencyConfig with max_parallel_agents, queue_depth, starvation_timeout_secs - All optional in orchestrator.toml with backward compatible defaults (time_only mode is default) - Tests: parse config with and without workflow section, defaults applied Issue #9: Unified dispatcher - Create crates/terraphim_orchestrator/src/dispatcher.rs - DispatchTask enum: TimeTask(agent_name, schedule), IssueTask(agent_name, issue_id, priority) - DispatchQueue: priority queue backed by BinaryHeap with round-robin fairness - ConcurrencyController: semaphore-based with starvation timeout - Methods: submit(task), next() -> Option<DispatchTask>, active_count(), is_full() - Fairness: alternates between time and issue tasks at equal priority - Tests: submit/dequeue, priority ordering, concurrency limits, fairness Issue #10: Issue mode controller - Create crates/terraphim_orchestrator/src/issue_mode.rs - IssueMode struct using terraphim_tracker::GiteaTracker to poll for issues - Poll loop every poll_interval_secs fetching ready issues via PageRank sorting - Filter out blocked issues and already-running tasks - Map issues to agents based on labels ([ADF] -> implementation-swarm) or title patterns - Submit IssueTask to DispatchQueue - Tests: poll cycle, issue-to-agent mapping, priority calculation, blocked issue filtering Issue #11: Time mode refactor - Refactor crates/terraphim_orchestrator/src/scheduler.rs - Extract TimeMode struct wrapping existing cron scheduler - TimeMode submits TimeTask to DispatchQueue instead of spawning directly (when configured) - Maintain backward compatibility: legacy mode spawns directly if no WorkflowConfig - Tests: TimeMode submits to queue, legacy mode still works Refs #8 #9 #10 #11

- Add ModeCoordinator that manages TimeMode and IssueMode - Implement unified shutdown: signal modes, drain queue, wait for active tasks - Extend spawner integration to dispatch tasks from queue to agents - Add stall detection: log warning when queue exceeds threshold - Add comprehensive tests for dual mode, shutdown coordination, and stall detection Refs #12

Create comprehensive E2E test suite in tests/e2e_tests.rs: - test_dual_mode_operation: verify both time and issue tasks processed - test_time_mode_only: legacy config compatibility - test_issue_mode_only: issue-only config verification - test_fairness_under_load: no starvation between task types - test_graceful_shutdown: clean termination with queue draining - test_stall_detection: warning logged when queue exceeds threshold - Additional tests for concurrency limits, prioritization, and backward compatibility Uses mock tracker and avoids real API calls. Refs #13

- Create src/compat.rs: Symphony compatibility layer with type aliases, adapters - Add migration helpers for enabling dual mode from legacy configs - Create MIGRATION.md with comprehensive migration guide - Update CLAUDE.md with dual mode architecture description - Include backward compatibility notes and configuration examples - Export compat module from lib.rs Refs #14

- Run cargo test --workspace (orchestrator tests pass) - Fix all warnings in orchestrator crate - Create CHANGELOG.md entry for v1.9.0 release - Verify backward compatibility with test_backward_compatibility - Build release binary: cargo build --release -p terraphim_orchestrator - All 142 tests pass in orchestrator crate Release includes: - Dual mode orchestrator with TimeMode and IssueMode - Unified dispatch queue with fairness - Stall detection and graceful shutdown - Symphony compatibility layer - Comprehensive E2E test suite - Migration documentation Refs #15

- Fix needless borrows for generic args in workspace crate - Apply consistent code formatting across modified files

AlexMikhalev

PR #705 Review: Subscription-Based Provider Routing

Reviewer: Automated code review
Verdict: Changes requested -- several issues need addressing before merge

File Hygiene (HIGH)

117 of 205 changed files are NOT relevant to this PR:

101 files: terraphim_server/dist/assets/bulmaswatch/ -- CSS theme source files that should not be in version control (or at minimum should not be in this feature PR)
5 files: terraphim_server/dist/assets/*.js -- compiled frontend bundles
8 files: reports/upstream-sync-*.md, reports/coordination-*.md, reports/security-*.md -- ephemeral agent-generated reports
3 files: Root-level upstream-sync-*.md, coordination-*.md -- duplicates of the above
1 file: .deployment-marker -- deployment state file
1 file: ~/.config/terraphim/sessions/telegram_287867183.jsonl -- personal session data leaked into PR

Action: Remove all unrelated files from this PR. The dist/ directory should be in .gitignore. The reports belong in a separate artifact store or CI output.

Architecture (MEDIUM)

Good

Clean three-crate split: terraphim_spawner (process management), terraphim_orchestrator (scheduling), terraphim_tracker (issue tracking)
Circuit breaker design (Closed -> Open after 3 failures -> Half-Open after 5min) is sound
Four-tier provider routing (Quick/Deep/Implementation/Oracle) with fallback chains is well-structured
ADRs are clear and well-motivated

Issues

Duplicated ProviderTier enum across terraphim_spawner and terraphim_orchestrator. Extract to terraphim_types to prevent drift.
SkillResolver hardcodes all 30 skills (18 terraphim + 12 zestic) as string literals in Rust code. Adding a skill requires recompile. The filesystem paths are configured but not used for discovery.
Circuit breaker state is in-memory only -- resets on process restart. Acceptable for now but document the trade-off.

Security (CRITICAL + MEDIUM)

CRITICAL: `PermittedProviderFilter` uses substring matching

The filter uses terraphim_automata::find_matches() which performs Aho-Corasick substring matching. A provider string like "not-opencode-go" would match "opencode-go" in the thesaurus and pass validation. A malicious provider name containing a permitted substring would bypass the guard.

Fix: Switch to exact match via HashSet::contains on provider names, or validate that the match spans the entire input string. Automata-based substring matching is the wrong tool for an allowlist check.

MEDIUM: Path traversal in `sfia_metaprompt`

At lib.rs:744, sfia_metaprompt is joined to working_dir and read from disk without sanitization:

let full_path = working_dir.join(metaprompt_path);
std::fs::read_to_string(&full_path)

If config ever comes from an untrusted source, ../../etc/passwd would read arbitrary files. Add path canonicalization and verify the resolved path stays within working_dir.

Code Quality (HIGH + MEDIUM)

HIGH: `Mutex::unwrap()` in production code (`health.rs`)

Lines 335, 341, 348, 354, 372 all use .lock().unwrap(). A poisoned mutex will panic the health check loop. Use .lock().unwrap_or_else(|e| e.into_inner()) or switch to parking_lot::Mutex (no poisoning).

MEDIUM: Dead `mpsc` channel in `OutputCapture`

At output.rs:47:

let (event_sender, _event_receiver) = mpsc::channel(100);

The receiver is immediately dropped. All subsequent event_sender.send() calls silently fail. Either remove the channel entirely (broadcast is the real delivery) or wire up the receiver.

MEDIUM: Thesaurus cloned on every `is_permitted()` call

At lib.rs:136, thesaurus.clone() is called for every spawn request validation. If this PR switches to HashSet (per the security fix), this becomes moot.

Testing (MEDIUM)

Good

92 unit + 11 integration tests is solid coverage
Tests for persona injection (full/partial/none), skill chains, NDJSON parsing, subscription guard
Using echo/cat as mock agent binaries is a pragmatic approach

Missing

No half-open/recovery test for CircuitBreaker -- only Closed->Open is tested. The cooldown timer expiry and success-in-half-open paths are untested.
claude_session_tests.rs re-implements types locally instead of importing from the crate. If production parsing diverges from these local structs, tests will pass while production breaks.
Weak assertion in test_parse_malformed_json_returns_none:

result.is_none() || input.is_empty()

The || input.is_empty() clause makes the assertion trivially true for empty strings.

Config (MEDIUM)

`orchestrator.toml` has hardcoded absolute paths

Multiple references to /home/alex/terraphim-ai, /home/alex/.local/bin/claude, /home/alex/.bun/bin/opencode, /opt/ai-dark-factory/. These only work on one machine.

Fix: Use environment variable substitution ($HOME, $TERRAPHIM_DIR) or make paths relative to working_dir.

Dependencies (LOW)

New deps (nix, regex, chrono, uuid, cron) are all justified
terraphim_automata is heavy for what is essentially a set membership check in PermittedProviderFilter -- if switching to HashSet, this dep can be removed from spawner

Summary of Required Actions

Priority	Issue	Action
CRITICAL	Substring matching in provider filter	Switch to exact match
HIGH	117 unrelated files in PR	Remove from branch, update `.gitignore`
HIGH	`Mutex::unwrap()` in `health.rs`	Use poison-safe locking
MEDIUM	Dead mpsc channel in `OutputCapture`	Remove or wire up
MEDIUM	Duplicated `ProviderTier` enum	Extract to `terraphim_types`
MEDIUM	Path traversal in `sfia_metaprompt`	Canonicalize and validate
MEDIUM	No circuit breaker half-open test	Add recovery test
MEDIUM	Hardcoded paths in orchestrator.toml	Use env vars or relative paths
LOW	Hardcoded skill registry	Load from filesystem

Good foundational work on the provider routing and persona system. The architecture is sound and the ADRs are well-written. Addressing the security issue with substring matching and cleaning up the file hygiene would make this ready for merge.

AlexMikhalev and others added 30 commits March 17, 2026 16:24

feat(spawner): add integration tests for opencode provider dispatch R…

812555c

…efs #38

fix(ops): use absolute CLI paths in orchestrator.toml

708500c

SSH non-interactive shells (systemd) do not include ~/.bun/bin or ~/.local/bin in PATH. Use full paths for opencode and claude binaries. Also fix TOML escape in --since quote. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat(judge): add parallel batch evaluation via ExecutionCoordinator R…

67d3b59

…efs #23

feat(judge): build judge-evaluator CLI binary Refs #27

bed0812

feat(spawner): add integration tests for ClaudeCodeSession Refs #5

0199a91

feat(workspace): extract terraphim_workspace crate Refs #6

2c0a53a

feat(tracker): extract terraphim_tracker crate Refs #7

d57bc27

AlexMikhalev and others added 3 commits March 20, 2026 20:55

style: fix clippy warnings and formatting

3c91da3

- Fix needless borrows for generic args in workspace crate - Apply consistent code formatting across modified files

AlexMikhalev commented Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(adf): subscription-based provider routing with persona and skill chains#705

feat(adf): subscription-based provider routing with persona and skill chains#705
AlexMikhalev wants to merge 33 commits intomainfrom
task/28-provider-model-fallback-schema

AlexMikhalev commented Mar 20, 2026

Uh oh!

AlexMikhalev left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AlexMikhalev commented Mar 20, 2026

Summary

Stats

ADRs

Test plan

Uh oh!

AlexMikhalev left a comment

Choose a reason for hiding this comment

PR #705 Review: Subscription-Based Provider Routing

File Hygiene (HIGH)

Architecture (MEDIUM)

Good

Issues

Security (CRITICAL + MEDIUM)

CRITICAL: PermittedProviderFilter uses substring matching

MEDIUM: Path traversal in sfia_metaprompt

Code Quality (HIGH + MEDIUM)

HIGH: Mutex::unwrap() in production code (health.rs)

MEDIUM: Dead mpsc channel in OutputCapture

MEDIUM: Thesaurus cloned on every is_permitted() call

Testing (MEDIUM)

Good

Missing

Config (MEDIUM)

orchestrator.toml has hardcoded absolute paths

Dependencies (LOW)

Summary of Required Actions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CRITICAL: `PermittedProviderFilter` uses substring matching

MEDIUM: Path traversal in `sfia_metaprompt`

HIGH: `Mutex::unwrap()` in production code (`health.rs`)

MEDIUM: Dead `mpsc` channel in `OutputCapture`

MEDIUM: Thesaurus cloned on every `is_permitted()` call

`orchestrator.toml` has hardcoded absolute paths